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Abstract 

Standard multi-armed bandits model decision problems in which the 
consequences of each action choice are unknown and independent of each 
other. But in a wide variety of decision problems - from drug dosage 
to dynamic pricing - the consequences (rewards) of different actions are 
correlated, so that selecting one action provides information about the 
consequences (rewards) of other actions as well. We propose and analyze 
a class of models of such decision problems; we call this class of models 
global bandits. When rewards across actions (arms) are sufficiently cor¬ 
related we construct a greedy policy that achieves bounded regret, with a 
bound that depends on the true parameters of the problem. In the special 
case in which rewards of all arms are deterministic functions of a single 
unknown parameter, we construct a (more sophisticated) greedy policy 
that achieves bounded regret, with a bound that depends on the single 
true parameter of the problem. For this special case we also obtain a 
bound on regret that is independent of the true parameter, this bound is 
sub-linear, with an exponent that depends on the informativeness of the 
arms (which measures the strength of correlation between arm rewards). 


1 Introduction 


Multi-armed bandits provide powerful models and algorithms for decision prob¬ 
lems in which the consequences of each action choice are unknown. The standard 
analysis of multi-armed bandits assumes that consequences of each action choice 
are independent of each other. But in a wide variety of decision problems - from 
drug dosage to dynamic pricing - the consequences (rewards) of different ac¬ 
tions are correlated, so that selecting one action provides information about the 
consequences (rewards) of other actions as well. In this paper we propose and 
analyze a class of models of such decision problems; we call this class of models 
global bandits. 

We begin by constructing a model of globally informative multi-armed ban¬ 
dits (GI-MAB) that formalizes the idea that rewards across actions (arms) are 
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correlated. For GI-MABs, we construct a greedy policy that achieves bounded 
regret, with a bound that depends on the true parameters of the problem. 

We then focus on a more restricted class of global bandits in which the 
expected rewards of all arms are functions of a single unknown parameter; we 
call these globally parametrized multi-armed bandits (GP-MAB). For GP-MABs, 
we construct a (more sophisticated) greedy policy that again achieves bounded 
regret, with a bound that depends on the single true parameter of the problem. 
In particular, we also obtain a bound on regret that is independent of the true 
parameter, this bound is sub-linear, with an exponent that depends on the 
informativeness of the arms (Informativeness is a measure of the strength of 
correlation between arm rewards). 

GP-MABs encompass the model studied in (Mersereau et al. 20091, in which 
it is assumed that the expected rewards of each arm are known linear functions 
of a single unknown global parameter, and which proves that a greedy policy 
achieves bounded regret. In this paper we consider a more general model in 
which the expected rewards of each arm are known. Holder continuous, possibly 
non-linear functions of a single unknown global parameter. (Thus, our model 


includes linear reward functions as considered in ( Mersereau et al.\ 20091 as a 
special case.) Allowing for non-linear reward functions significantly complicates 
the learning problem. If reward functions are linear, then the additional in¬ 
formation that can be inferred about the rewards of arm X by an additional 
single sample of the reward from arm Y is independent of the history of previous 
samples from arm Y. (The additional information about the rewards of arm X 
that can be inferred from obtaining sample reward r from arm Y is the same as 
the additional information about X that could be inferred from obtaining the 
sample reward L{r) from arm X itself, where L is a linear function that depends 
only on the reward functions themselves.) If reward functions are non-linear, 
then the additional information that can be inferred about the rewards of arm 
AT by a single sample of the reward from (a different) arm Y results in a biased 
estimation. Therefore, we need to incorporate the previous samples of arm X 
and Y to ensure that the bias asymptotically converges to 0. 

Many applications can be formalized as a GI-MAB and GP-MAB. Examples 
include clinical trials involving similar drugs (e.g., drugs with a similar chemical 
composition) or treatments which may have similar effects on the patients and 
hence, the outcome of administering one drug/treatment to a patient will yield 
information about the outcome of administering a similar drug/treatment to 
that patient. 


Example 1: Consider first drug dosage, which is studied in (Lai and Rob¬ 


bins 19781. Let Xi be the dosage level of the drug for patient i and j/i be the 
response of patient i. The relationship between the drug dosage and patient 
response is modeled in (Lai and Robbins 1978) as yi = M{xi; 0*) — c{xi) -j- Ci, 
where M(-) is the response function, 0* is the slope if the function is linear or 
the elasticity if the function is exponential or logistic, and c{xi) is the cost of 
the dosage level of drug and Ci is i.i.d. zero mean noise. For this model, 0* 
becomes the global parameter and the set of drug dosage levels becomes the set 
of arms. 
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Example 2: In dynamic pricing, an agent sequentially selects a price from 
a finite set of prices V with the objective of maximizing its revenue over a finite 


time horizon (Chen and Farias 20131. At time t, the agent first selects a price 
p £ V, and then observes the amount of sales, which is given as Sp^t{A) = 
Fp(A) + £(, where F(.) is the modulating function and Ct is the noise term with 
zero mean. The modulating function is the purchase probability of an item 
of price p given the market size A. In this example, the market size is the 
global parameter; this is unknown and needs to be learned by setting prices 
and observing the sales related to the set price. Commonly used modulating 
functions include exponential and logistic functions. 

In summary, the main contributions of our paper are: 

• We propose a new MAB model, called the GTMAB, in which selecting an 
arm reveals information about the rewards of all the arms. We show that 
GTMAB represents a generalization of GP-MAB, and GP-MAB includes 


linearly parameterized MAB (Mersereau et al. 2009) as a special case. 


Under a mild assumption on the correlation between the expected arm 
rewards, we show that the regret of the greedy policy for the GTMAB is 
bounded. 


• For GP-MAB, we propose a greedy policy that always selects the arm with 
the highest estimated expected reward. We prove that the greedy policy 
achieves bounded regret. 

• In addition to proving that the regret is bounded (which is related to the 
asymptotic behavior of the proposed policies), we also show how the re¬ 
gret increases over time by identifying and characterizing three regimes of 
growth: first, the regret increases at most sub-linearly over time until a 
first threshold after which it increases at most logarithmically over time 
until a second threshold, before converging to a finite regret. These thresh¬ 
olds have the property that they are decreasing in the informativeness of 
the arms. 


• We prove a sub-linear in time worst-case regret bound for the greedy 
policy, which does not depend on the value of the global parameter, hence 
holds for any global parameter value. The rate of increase of the regret 
in time decreases with the informativeness of the arms, implying that the 
regret will increase slower when the informativeness is high. Moreover, we 
also provide a matching lower bound for the worst-case regret bound. 


• Given a distribution over the set of global parameter values, we prove a 
Bayesian risk bound that depends on the informativeness. When the arms 
are fully informative, such as in the case of linearly parametrized bandits 
(Mersereau et al.. 20091, our Bayesian risk bound and our proposed greedy 
policy reduce to the well known Bayesian risk bound and the greedy policy 


in (Mersereau et al. 2009), respectively. 
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• We also study a non-stationary version of the GP-MAB, where the value of 
the global parameter slowly changes over time. We characterize the rate of 
change of the parameter value by introducing a stability constant, which is 
the inverse of the speed of change of the value of the global parameter, and 
prove a bound on the time-averaged regret that depends on the stability 
constant and the form of the parametric reward functions. 

The remainder of the paper is organized as follows. In Section 2, we de¬ 
scribe the related work and highlight the differences with respect to our work. 
In Section 3, we formulate the GI-MAB and prove that bounded regret can be 
achieved by the greedy policy. In Section 4, we formulate the GP-MAB and 
propose a variant of the greedy policy which can also achieve bounded regret 
which depends on the value of the global parameter. For this policy, we also 
prove sub-linear in time problem-independent regret bounds and Bayesian risk 
bounds that depend on informativeness of the arms. In Section 5, we provide re¬ 
gret analysis for the case when the global parameter is time-varying. Goncluding 
remarks are given in Section 6. 


2 Related Work 

Numerous variants of the MAB have been defined and investigated in the past 


decade - these include stochastic bandits 

,jai and Robbins, 1985 

Auer et al. 

2002| 

Auer 

2002 Garivier and Gappe / 

>011 

Rajeev et al. 

1989), Bayesian ban- 

dits dKaufmann et al. 

2012 

Thompson 

1933 Agrawal and Goyal 

2012 Korda 

et al. 

2013 

Bubeck and Liu 

20131, contextual bandits ( 

Langford and Zhang 

2008 

Slivkins, 2014 

Agrawal and Goyal 

2013), combh 

latorial t 
r method 

)andits (Gai 

et al. 

2012 

, and many others. Instead of comparing ou 

to all these 


MAB variants, we group the existing literature depending on their relationship 
to the theme of this paper: how the informativeness of an arm can be exploited 
to learn about the rewards of other arms. We call a MAB non-informative if 
the reward observations of any arm do not reveal any information about the 
expected rewards of any other arms. Examples of non-informative MAB are the 


stochastic bandits ( 

Lai and Robbins 

1985 

Auer et al. 

2002 

) and the bandits 

with local parameters (Agrawal and Goyal 

2012 

Kaufmann et al. 2012 

). In 


these problems the regret grows at least logarithmically in time, since each arm 
should be selected at least logarithmically many times to identify the optimal 
arm. We call a MAB group-informative if the reward observations from an arm 
provide information about the rewards of a known group of other arms but not 
all the arms. Examples of group-informative MABs are combinatorial bandits 
( Gai et al]|2012 ), contextual bandits ( Langford and Zhang[ 2008 Slivkins, 2014 


Agrawal and Goyal, 20131 and structured bandits (Rusmevichientong and Tsit- 


siklis 2010 Filippi et al.. 2011). In these problems the regret grows at least 


logarithmically over time since at least one suboptimal arm should be selected at 
least logarithmically many times to identify groups of arms that are suboptimal. 
We call a MAB problem globally-infoirmative if the reward observations from an 
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arm provides information about the rewards of all the arms. The proposed GP- 


MABs include the linearly-parametrized MABs in ( 

VIersereau et al. 

2009) 

as a 

subclass. Hence, our results reduce to the results of 

(Mersereau et al. 

200C 

1 for 


the special case when expected arm rewards are linear in the parameter. 

Another related to work is (Russo and Van Roy, 20151, in which the authors 
prove regret bounds that depend on the learner’s uncertainty about the optimal 
action. This uncertainty depends on the learner’s prior knowledge and prior ob¬ 
servations, and affect the constant factors that contribute to the 0{Vt) regret 
bound. Whereas, in our problem formulation, we show that the strong depen¬ 
dence of the arms result in a bounded problem specific regret and a sub-linear 
in time worst case, i.e., problem independent, regret whose time order depends 
on the informativeness of the arms (the strength of the correlation among the 
arms). 


In (Gittins 1989), a MAB problem is considered in which the arm rewards 


are parametrized with known priors and the aim is to maximize the discounted 
sum of future rewards. However, in the Gittins’ formulation of the MAB, the 
parameters of the arms are different from each other, and the discounting al¬ 
lows the learner to efficiently solve the optimization problem to determine the 
optimal arm by decoupling the joint optimization problem into K individual 
optimization problems - one for each arm. In contrast, we do not assume known 
priors, and the learner in our case does not solve an optimization problem but 
rather learns the global parameter through its reward observations. 


Another seemingly related learning scenario is the experts setting (Gesa- 
Bianchi et al. 1993), where after an arm is chosen, the rewards of all arms 
are observed and their estimated rewards are updated. Hence, since there is 
no tradeoff between exploration and exploitation, finite regret bounds can be 
achieved in such expert settings with hnite number of arms and stochastic arm 
rewards. However, unlike in the expert setting, the GTMABs and GP-MABs 
achieve finite regret bounds while observing only the reward of the selected arm. 
Hence, the arm reward estimation procedure in GTMABs and GP-MABs re¬ 
quires forming reward estimates by collectively considering the observed rewards 
from all the arms. This is completely different than in the expert settings, in 
which the expected reward of an arm is estimated only by using the past reward 
observations from that arm. 


3 Globally Informative Multi-Armed Bandits (GI- 
MAB) 

The set of all arms is denoted by /C and the number of arms is AT := |/C|, where 
I • I is the cardinality operator. The reward obtained by playing an arm k € K- 
at time t is given by a random variable X^f We assume that for t > 1 and for 
each arm k € 1C , is drawn independently from an unknown distribution 
with support [0,1] where E[Arfc^t] = Uk is the unknown parameter of the arm k 
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with E[-] denoting the expectation^ 

3.1 Definitions of joint and marginal parameter distribu¬ 
tions and estimator functions 

Let u denote the vector of the parameters of all arms where u = G 

[ 0 , 1 ]^, U-i denotes the vector of the parameters of all arms except arm i, and 
U-ij denote the vector of the parameters of all arms except arms i,j. We 
assume that the parameters of the arms are drawn a priori from an unknown 
joint distribution ^[u). The marginal distribution of the parameters is defined 
as follows: 


ViiUi) 












The exact parameter values, and the joint and marginal parameter distributions 
are unknown to the learner. However, the learner knows the estimator function 
fij for each pair of arms which is given as 


fij{u) := ^[u^\uj = u] 



Vij{Ui,u) 

Vj{u) 


dui. 


where E[-|-] denotes the conditional expectation. Basically, the estimator func¬ 
tion expresses the learner’s knowledge of the correlation of rewards. 

Remark 1. Note that is the best Mean-Squared Error (MSE) estimator 

of the parameter of arm i given the knowledge of arm j. 


Note that it is not true that we can have the exact information of the param¬ 
eter Uj given the knowledge of the parameter Ui. Having additional information 
on the estimator functions enables the learner to estimate of the parameters of 
the arms which are not selected. Dynamic pricing is an example application 
where the learner has a priori knowledge about the estimator functions. For in¬ 
stance, the revenue obtained by setting a price is informative about the market 


size A and therefore, the revenue obtained by setting another price (Chen and 


Farias 20131. 


To make the analysis tractable, we impose the following Holder continuity 
assumption on the estimator functions. 


Assumption 1. For each pair of arms {i,j) such that i ^ j, there exists con¬ 
stant D > 0 and exponent 7 > 0 such that \fi,j{u) — fi,jiu')\ < D\u — u'\^. 


^The set [0,1] is just a convenient normalization. In general, it is only needed that the 
distribution has a bounded support. 
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3.2 The benchmark solution and definition of regret 

The learner’s goal is to maximize its cumulative reward up to any time T. If 
the expected rewards were known by the learner, it would have always selected 
the arm k* S arg max^.^^ We call the policy that selects the arm with the 
highest parameter the oracle policy and denote the expected reward of its best 
arm as u* := max^g^c^^fe- The one-step regret is the difference between the 
expected reward of the oracle policy and our policy that selects arm It G K, 
at time t which is denoted by rt{u) := u* — uj^. The cumulative regret is the 
expected total loss of the learner by time T, which is given by 

T 

Reg{u,T) := 

i=l 

Any regret which increases sub-linearly in time guarantees convergence to the 
oracle policy in terms of the time averaged reward. We propose a greedy policy 
which achieves a bounded regret (independent of time horizon T) under a mild 
assumption on the structure of the correlation between the expected rewards of 
the arms which will be described in the next section. 


3.3 Weighted-arm greedy policy for GI-MAB and its re¬ 
gret analysis 

In this section, we propose a Weighted Greedy Policy (WAGP) which exploits 
the knowledge of the estimator functions. Let Nk{t) denote the number of times 
that arm k has been selected by time t. For each arm, we keep two estimates : 
Uk is sample mean estimator of arm k, i.e.. 


Uk = 


Nk{t — l)uk + 
Nk{t — 1) -b 1 


and is the combined sample mean estimator of arm fc, i.e 


12^=1,i^k N^it)fk4u^) + Nkit)uk 


( 1 ) 


( 2 ) 


The selection of WAGP is the arm with the highest estimated parameter, i.e.. 
It G argmax^.g^ u'j.. The pseudo code is given in Figure]^ 

By using the estimator functions for the arms, we avoid the exploration phase 
of the standard MAB algorithms such as UGBl ( |Auer et al. 2002). Instead, 
WAGP always selects the arm with the highest estimated parameter, where the 
parameter estimates are formed by only considering the reward observations 
from the most selected arm. 

Next, we define the informativeness of the estimator functions. 


Definition 1. We say that the estimator functions fij are e-informative for a 
pair of arms {i,j) if \fi,j{uj) — Ui\ < e almost surely for all u ~ v{u). 
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Input : fij for all arm pairs {i,j) 

Notation: = (Uk)k=i, « = (uk)k-i 

Initialization : Nk{t) = 0, ■u'^ = 0, u = 0 

while t > 1 do 
if t = 1 then 

Randomly select arm It from the set K. 
else 

Select the arm It G argmaxj,gj(^ u1 


end if 

Observe the reward Xi^^t 
Update ui, = 


Update Nj^ (t) = Nj^ (t — 1) + 1 and Nk{t) 


Update ul 


Ni(t)fk,i(ui)+Nk(t)uk 

t 


end while 


= Nk{t - 1) for all fc G /C \ It 
for all A: G A1 


Figure 1: Pseudocode of the weighted arm greedy policy for GI-MAB. 


Let S := inink£/c\k* ~ denote the suboptimality gap between the ex¬ 
pected reward of the best arm and the second best arm. The following theorem 
shows that the one-arm greedy policy achieves bounded regret when the esti¬ 
mator functions are sufficiently informative. 

Theorem 1. If estimator functions fij are S/2-informative for all pair of arms 
{i,j), then cumulative regret incurred by weighted-arm greedy policy is bounded 
for any u ^ iy(u), i.e., limT-^oo Re-g(u, T) < oo. 

The result of Theorem can be interpreted as follows. This ensures that 
the parameter estimate of the arms converges to its true value exponentially 
fast (by Chernoff-Hoeffding inequality). Due to this, if the estimator functions 
fij for all pairs {i,j) give sufficient information about the expected rewards 
of the other arms (by being i5/2-informative), the weighted-arm policy achieves 
bounded regret. 

In the next section, we introduce a special case of the GI-MAB called globally 
parametrized multi-armed (GP-MAB) problem. Due to its structure, we are 
able to show worst-case and Bayesian risk bounds on the regret of the GP-MAB 
in addition to a bounded problem specific regret. 


4 Globally Parametrized Multi-Armed Bandits 
(GP-MAB) 

4.1 Problem Formulation 

As in the previous section, K, denotes the set of arms and denotes the 
reward of arm k at time t. We assume that for t > 1 and k G 1C, Xk^t is 
drawn independently from an unknown distribution Vk{d^) with support [0,1], 
where 0* is an unknown single-dimensional global parameter belonging to set 
0, which we take as the unit interval for simplicity of exposition. The expected 





reward of an arm k G 1C is a Holder continuous, invertible function of the global 
parameter, which is given by where Ei,[-] denotes the 

expectation taken with respect to distribution v. The learner knows that the 
expected reward of an arm as a function of the global parameter, i.e., := 

Assumption 2. (i) For each k G 1C, the reward function /Xfc is invertible on 

[ 0 , 1 ]. 

(ii) For each k G 1C and y,y' G [0,1], there exists Di^k > 0 and 0 < 71 ^^ < 1 
such that — k-k^{y')\ — Di,k\y — , where is the inverse reward 

function for arm k. 

(Hi) For each k G 1C and 0,9' G Q there exists £) 2 ,fc > 0 and 0 < 72 ,fc < 1, such 
that Ifikid) - fJ-ki(l')\ < E> 2 ,fc |0 - . 

Assumption ensures that the reward obtained from an arm can be used 
to update the estimated expected rewards of the other arms. The last two 
conditions are Holder conditions on the reward and inverse reward functions, 
which enable us to define the informativeness. Let 71 and D 2 be the maximum 
of the constants 71 ^*, and D 2 ^k and Di and 72 be the minimum of exponents Di^k 
and 72 ,fc, respectively. It is worth noting that Assumption is an assumption 
about the individual reward functions Hk- Assumption is mild and is satisfied 
by the following reward functions: (i) exponential functions such as fJ-kid) = 
aexp(& 0 ) for some a > 0 , (ii) linear and piecewise linear functions, and (iii) 
sub-linear and super-linear functions in 0 which are invertible in 0 such as 
y,k{0) = 00 ^^ with 7 > 0 for 0 = [ 0 , Ij. 

Proposition 1. Under Assumption^ the estimator functions are 5/2- infor¬ 
mative. 

It follows from Proposition that there exists a policy for which the problem 
specific regret is bounded when Assumption holds. 

Let fc*(0*) := argmax^.g^be the set of optimal arms and p*{0*) := 
maxfcgyc bfi Ills expected reward of the optimal arm for the true value 

of the global parameter 0*. The cumulative regret of learning algorithm which 
selects arm U until time horizon T is defined as 

T 

Reg(6»,,T) := 

t=i 

where rt{0*) is the one-step regret given by rt( 0 *) := pC{0*) — yit {0*) for global 
parameter 0*. In the following sections we will derive regret bounds both as 
a function of 0 * {problem specific regret) and independent from 0 * {worst-case 
regret). 

4.2 Weighted-Arm Greedy Policy (WAGP) 

In this section, we propose a WAGP for the GP-MAB problem, which selects 
the arm with the highest estimated expected reward at each time t. In contrast 
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Input : for each k £ IC 

Initialization: Wk(0) = 0, Sk,i = 0, A^fc(O) = 0 for all fe € /C 

while t > 1 do 
if t = 1 then 

Randomly select arm It from the set fC 
else 

Select the arm It £ argmax/.^/^^ fik(0t-i) 

end if 

Observe the reward Xi^^t 
Xk^t = Xk,t-i for a\\k£X\It 

-(y _ — 

h,t - jv_fjt-l) + l 

for R: G /C do 

if 30 € 0 such that Hk{d) = Xk,t then 

0fc,t — 

else 

9k,t = argminege - Xk,t\ 

end if 
end for 

=Ar,,(t-l) + l 

Nkit) = Nk{t - 1) for all k£X\ It 
Wk{t) = Nk{t)/t for all fc G A1 
Wk{t)6k,t 

end while 


Figure 2: Pseudocode of the WAGP. 


to previous works in MABs (Auer et al. 2002 Lai and Robbins 1985) in which 
the expected reward estimate of an arm only depends on the reward observa¬ 
tions from that arm, the proposed greedy policy constructs an estimated global 
parameter, given by 6t where Wfc(t) is the weight of arm k 

at time t and 9k,t is the estimate of the global parameter based only on the 
reward observations from arm k until time t. Let Xk,t denote the set of rewards 
obtained from the selections of arm k by time t, i.e., Xk,t = {Xk,t)T<t |/x=fe 
and Xk,t be the sample mean estimate of the rewards obtained from arm k by 
time t, i.e., Xk,t '■= iJ2x£Xk t ^)l\^k,t\- The proposed greedy policy operates as 
follows for any time t > 2: (i) the arm with highest expected reward according 
to the estimated parameter 9t-i is selected, i.e., It G argmaxj.^^(ii) 
the reward Xi^^t is obtained and the individual reward estimates Xk,t are up¬ 
dated for k £ JC, (iii) the individual estimate of the global parameter for each 
arm k is updated in the following way : if there is no global parameter 9 £ Q 
such that Hk{9k,t) = Xk,t for arm fc, then the individual estimate for arm k 
is updated such a way that the gap between ^k(9k,t) and Xk,t is minimized, 


i.e., 9k,t = argmingge \iik{9) — Xk,t\', otherwise, individual estimate of arm k 
is updated as 9k,t = f^e weight of each arm k is updated as 

Wk{t) = Nk{t)/{t), where Nk{t) is the number of times the arm k is played until 
time t. For t = 1, since there is no global parameter estimate, the greedy policy 
selects randomly among the set of arms. The pseudocode of the greedy policy 
is given in Fig. 
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4.3 Preliminaries for the regret analysis 

In this subsection we define the tools that will be used in deriving the regret 
bounds for the WAGP. Consider any arm k G 1C. Its optimality region is defined 
as 0fc := {0 G 0 : fc S k*{0)}. Clearly, we have UfeeK = 0. If 0fc = 0 
for an arm k, this implies that there exists no global parameter value for which 
arm k is optimal. Since there exists an arm k' such that > pk{0) for any 

0 G 0 for an arm with 0^ = 0, the greedy policy will discard arm k after t = 1. 
Therefore, without loss of generality we assume that 0^ ^ 0 for all k G 1C. 
For the global parameter 0* G 0, we define the suboptimality gap of an arm 
k G /C\fc*(0*) as (5fe(0*) := /r*(0*) — ^^(0*). For the parameter 0*, the minimum 
suboptimality gap is defined as (5niin(0*) := ininfeg;c\fc*(e.,) Sk{0*). 


— Arm 1 — Arm 2 — Arm 3 

p,(0)=\-4e ix^{e)=Q.w Pi(e)=e" 



Figure 3: Illustration of minimum suboptimality gap and suboptimality dis¬ 
tance. 


Recall that the estimated expected reward of arm k is equal to its expected 
reward corresponding to the global parameter estimate. We will show that as 
more arms are selected, the global parameter estimate will converge to the true 
value of the global parameter. However, if 0* lies close to the boundary of the 
optimality region of fc*(0*), the global parameter estimate may fall outside of the 
optimality region of /c*(0*) for a large number of time steps, thereby resulting 
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in a large regret. Let be the sub optimality region for given global 

parameter 0 *, which is defined as the subset of parameter space in which an 
arm in the set K, \ k*{9^) is optimal, i.e., := Ufc/g;c\fe*(e.)0fc'- In order 

to bound the expected number of such deviations from the optimality region, 
for any arm k we define a metric called the suboptimality distance, which is 
equal to the smallest distance between the value of the global parameter and 
suboptimality region. 

the suboptimality distance is 

i/ 0 -^( 0 *) =0 

From the definition of the suboptimality distance it is evident that the pro¬ 
posed policy always selects an optimal arm in when Ot is within Aniin( 0 ») 

of the global parameter 0*. An illustration of the suboptimality gap and sub¬ 
optimality distance is given in Fig. [^for a GP-MAB instance with 3 arms and 
reward functions pii{0) = 1 — y/9, = 0.80 and fJ-siO) = 9'^, 9 £ [0,1]. 

In the following lemma, we show that minimum suboptimality distance is 
nonzero for any global parameter 0*. This result ensures that we can identify 
the optimal arm within finite amount of time. 

Lemma 1. Given any 0* G 0, there exists a constant = (diiii„( 0 *)/ 2 D 2 )^^^^; 
where D 2 and 72 are the constants given in Assumption^such that Amin(0*) > 
ee,. In other words, the minimum suboptimality is bounded above a positive 
number. 

For notational brevity, in this section we use A* := Amin(^») and 5* := 

Lemma 2 . Consider a run of WAGP until time t. Then, the following relation 
between 9t and 0 * holds with probability one: | 0 t — 0 *| < 'Wk{t)Di\Xk t — 

l^k{0.)\'^P 

Lemma[^ shows that the gap between the global parameter estimate and the 
true value of the global parameter is bounded by a weighted sum of the gaps 
between the estimated expected rewards and the true expected rewards of the 
arms. 

Lemma 3. For given global parameter 0*, the one-step regret of the greedy 
policy is bounded by rt(9G) = /i*(0*) — yiit{9)f) < 2 ZI 2 I 0 * — with probability 
one, where It is the arm selected by the greedy policy at time t >2. 

Lemma ensures that the one-step loss decreases as 0* approaches to 0*. 
Since the regret at time T is the sum of the one-step losses up to time T, we 
will bound the regret by bounding the expected distance between 9t and 0 *. 

Given a parameter value 0*, let Gg §^{^) ~ ^t| > a:} be the event 

that the distance between the global parameter estimate and its true value 


Definition 2. For a given global parameter 0*, 
defined as 


A (n \ _j 1 ^* 

\ 1 
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exceeds x. Similarly, let J'g ^ (a;) := {\Xk,t — M/c(0*)| > x} be the event that 
the distance between the sample mean reward estimate of arm k and the true 
expected reward of arm k exceeds x. The following lemma relates these events. 

Lemma 4. For any t >2 and given global parameter 9^, we have 

((^) ) 


with probability one. 

This lemma follows from the decomposition given in Lemma This lemma 
will be used to bound the probability of event Gg g^ {x) in terms of probabilities 

of the events 


4.4 Problem Independent (Worst-case) regret bounds for 
WAGP 


In subsequent section we show that the problem specific regret is bounded. The 
bound depends on through A*; hence, it may blow up as A* —>■ 0 (i.e., 
1 /A* —)• oo). Thus, we go on to establish worst-case regret bounds. Along the 
way we show that “complete learning” occurs, in the sense defined below. 


Definition 3. We say that a policy achieves complete learning if the estimate 
9t of the parameter converges in probability to its true value 0*. 


It is shown in (Mersereau et al. 2009) that the linear bandit model achieves 


complete learning under an assumption on the slope of the reward function. 
The next theorem proves the convergence of 9t under Assumption]^ 

Theorem 2. Under Assumption^ the global parameter estimate of the WAGP 
converges to true value of global parameter in mean-squared sense, i.e., 
limt_>oo E[|0( - 6>*p] = 0. 


Theorem shows that even though the greedy policy obtains non-linear and 
noisy observations of the global parameter, the estimator converges to the true 
value of the global parameter in the mean sqaured sense. 


Corollary 1. Under Assumption^ the WAGP achieves complete learning. 
The following theorem bounds the expected regret of WAGP in one-step. 


Theorem 3. Under Assumption^ for given global parameter 0*, the expected 
one-step regret of the greedy policy is bounded by E[rt(0*)] = Oft ~). 

Theorem [^proves that the expected loss incurred in one-step by the greedy 
policy goes to zero with time and also bounds the expected loss that will be 
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incurred at any time step This is a worst-case bound in the sense that it 
does not depend on 0*. Using this result, we derive the worst-case regret bound 
in the next theorem. 

Theorem 4. Under Assumption^ the worst-case regret of WAGP is 

sup E[Reg{e,,T)] = 
e.ee 


Note that the worst-case regret bound is sub-linear both in terms of the time 
horizon T and the number of arms K. Moreover, it depends on the form of the 
reward functions given in Assumption 1. The Holder exponent 71 on the inverse 
reward functions characterizes the informativeness of an arm about the other 
arms. The informativeness of an arm k can be viewed as the information ob¬ 
tained about the expected rewards of the other arms from the rewards observed 
from arm k. The informativeness is maximized for the case when the inverse 
reward functions are linear or piecewise linear, i.e., 71 = 1. It is increasing 
with 7 i, which results in the regret decreasing with the informativeness. On 
the other hand, the Holder exponent 72 is related to the loss due to suboptimal 
arm selections, which decreases with 72 . Both of these observations follow from 
Lemmas and As a consequence, the worst-case regret is decreasing in both 
7 i and 72 . 0 

When the reward functions are linear or piecewise linear, we have 71 = 72 = 
1 , which is an extreme case of our model; hence, the worst-case regret is 0{VT), 
which matches with the worst-case regret bound of standard MAB algorithms 
in which a linear estimator is used (Bubeck and Cesa Bianchi 2012[) and the 


bounds obtained for the linearly parametrized bandits (Mersereau et al. 2009). 


4.5 Problem specific regret bonds for the WAGP 


Although the regret bound derived in the previous section holds for any global 
parameter value, the performance of the greedy policy depends on the true 
value of the global parameter. For example, it is easier to identify the optimal 
arm in a GP-MAB with a large suboptimality distance than a GP-MAB with 
a small suboptimality distance. In this section, we prove a regret bound that 
depends on the suboptimality distance. Our regret bound is characterized by 
three regimes of growth: first a regime of sub-linear growth of regret, then a 
regime of logarithmic growth, then a regime of convergent growth of regret. 

The boundaries of these regimes are defined by problem-specific constants, 
as discussed below. 


Definition 4. Let C'i(A*) be the least integer t such that r > 

2 

let C' 2 (A=^) be the least integer r such that r > log(T). 

A* 


2 

^ log(r) and 
2A* '>'1 


^The asymptotic notation is only used for a succinct representation, to hide the constants 
and highlight the time dependence. This bound holds not just asymptotically but for any 
finite t. 

^Informativeness is a measure of the strength of correlation between arm rewards. 
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The constants C'i(A,) and C' 2 (A*) depend on the informativeness (Holder 
exponent 71 ) and global parameter d*. We define the expected regret between 
time Ti and T 2 for global parameter 0 * as 

T-2 

RbATi,T2)-.= ^E[rt(0*)]. 

The following theorem gives a three regime problem specific regret bound. 

Theorem 5. Under Assumptions^ the regret of the WAGP is as follows: 

(i) For 1 <T < C'i(A*), the growth of regret is sub-linear in time, i.e., 

ReS0,T) < Si + S2T^-^, 

(ii) For Ci(A*) <T< C 2 (A*), the growth of regret is logarithmic in time, i.e., 

/!..(c.(A.),r)<i + 2/ri„g(^), 

(Hi) For T > C 2 (A*), the growth of regret is bounded, i.e., 

ReSC2{^.),T)<K^. 

where Si and S 2 are constants independent of global parameter 0* and given in 
Appendix |y.5| 

Corollary 2. The regret of the WAGP is bounded, i.e., liuir^oo Reg(T,9.f) < 
00 . 


At each time t < T in each regime in Theorem]^ the probability of selecting 
a suboptimal arm is bounded by different functions of t, which leads to different 
growth rates of the regret bound depending on the value of T. For instance, 
when C'i(A*) <t< C' 2 (A*), the probability of selecting a suboptimal arm is in 
the order of t~^; hence, the greedy policy achieves the logarithmic regret, when 
t > (72 (A*), the probability of selecting a suboptimal arm is in the order of 
which makes the probability of selecting a suboptimal arm infinitely often go to 
zero. In conclusion, the greedy policy achieves bounded regret. 


Theorem 6. The sequence of arms selected by the WAGP converges to the 
optimal arm almost surely, i.e., lim(_,,oo = fc*(0*) with probability 1. 


Theorem implies that a suboptimal arm is selected by the greedy policy 
only finitely many times. In other words, there exists a finite number such that 
selection of greedy policy is the optimal arm after that number with probability 
1. This is the biggest difference between the proposed global bandits and MAB 
algorithms (Lai and Robbins 1985 Auer et aL 2002 Russo and Van Roy 20151 
in which suboptimal arms are selected infinitely many times and the proposed 
greedy policy. 
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Although the problem specihc regret bound is finite, since limA.->.o C'i(A,) = 
oo, in the worst-case, this bound reduces to the worst-case regret bound given 
in Theorem Summarizing: the proposed policy differs from standard MAB 
policies in that (i) it achieves bounded regret (Theorem]^ and (ii) it learns the 
optimal arm in finite time (Theorem]^. 

4.6 Bayesian risk analysis of the WAGP 

In this section, assuming that global parameter is drawn from an unknown 
distribution / on 0, we analyze the Bayesian risk, which is defined as follows: 


Risk(T) := 


E^(6)) 


■ T 

Y^rt{6)\e,=6 


where v{6) = Xk^i^kid) is the joint distribution of the rewards given 6. The 
Bayesian risk is equal to the expected regret with respect to the distribution 
of the global parameter. Since the suboptimality distance is a function of the 
global parameter 0 *, there is a prior distribution on the minimum suboptimality 
distance, which we denote as (;(A*). A simple upper bound on the Bayesian risk 
can be obtained by taking the expectation of the regret bound given in Theorem 
l^with respect to 0*, which gives the bound Risk(T) = ). A tighter 

regret bound on the Bayesian risk can be derived if the following assumption 
holds. 


Assumption 3. The prior distribution on the global parameter is such that min¬ 
imum suboptimality distance A* has a bounded density function, i.e., ^(A*) < 


B. 


Theorem 7. Under Assumptions\^ auc?[^ the Bayesian risk of the WAGP is 
bounded by 

(i) Risk{T) = O(logT), for 7172 = 1. 

(ii) Risk{T) = for 7172 < 1. 


Our Bayesian risk bound obtained for the WAGP coincides with the Bayesian 
risk bound for the linearly-parametrized MAB given in (Mersereau et al. , 20091 
when the arms are fully informative (7172 = !)• For this case, the optimality of 


the Bayesian risk bound is established in (Mersereau et al. 2009), in which a 


lower bound of f2(logr) is proven. Similar to the worst-case regret bound given 
in Theorem]^ the Bayesian risk is also decreasing with the informativeness, and 
it is minimized for the case when the arms are fully informative. 


4.7 Lower bounds on the regret 

0 T1 T2 

that worst-case regret bound is 0{T^ ~), which 

implies that the regret is decreasing with Holder exponents 71 and 72 . In this 
section, we show that this is the best attainable regret order for the family of 
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policies that use a global estimator 6t. Essentially, these policies use the reward 
observations from the arms to estimate a global parameter 9t, and then, use 
the estimated global parameter together with the expected reward functions to 
estimate the arms’ expected rewards. 

Theorem 8. For the family of policies that use a global estimator Ot and T > 8, 
there exists 0* € [0, 1] for which Fi[Reg{0^,,T)] > where B = (1 — 

exp(—16)). 

Theorem shows that by choosing unfavorably small values for A* (for 
instance, by lettirrg A* = as in the proof of Theorem^, the regret 

can be made to grow as t ^ before stabilizing to the finite value. Therefore, 
there exists a problem instance for which the worst-case regret bound of WAGP 
matches the lower bound (in terms of the time order). 

5 Non-stationary global parameter 

In this subsection, we consider the case when the global parameter changes over 
time. We denote the global parameter at time t as 9\. The reward of arm k 
at time t, i.e., Xi^t, is drawn independently from the distribution Vk{0V) where 
= pLkiSl). In the dynamic model, we assume only the parameter is 
changing over time, while the reward functions remain the same. In order to 
bound the regret, we impose a restriction on the speed of change of the global 
parameter which is formalized in the next assumption. 


2(p-1)t, 


Passive 
Sub round 
of round 

p-i 


4 


Passive 
Sub round 
of round 
P 


Active 
Sub round 
of round 
P 


Active 
Sub round 
of round 

p-i 


(2p-l)r„ 


2pT„ 

> Time t 


Figure 4: Rounding structure for time windowed WAGP. 


Assumption 4. For any t and t', we have 

t t' 

T T 


\0i-9U<L 
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where r > 0 is the stability of drift, and a > 0 is the exponent of the drift. 

The WAGP needs to be modified to handle a non-stationary global param¬ 
eter since the optimal arm k*{9l) may be changing over time. To do this, the 
modified WAGP uses only a recent past window of reward observations from 
the arms when estimating the global parameter. By choosing the window length 
appropriately, we can balance the regret due to the variation of the global pa¬ 
rameter over time given in Assumption]^ and the sample size within the window. 

The modified algorithm groups the time slots into rounds p = 1, 2,..., each 
having a fixed length of 2Th, where Th is called half window length. The key 
point in the modified algorithm is to keep separate counters for each round and 
estimate the global parameter in a round based only on observations that are 
made within the particular window of each round. Each round p is further 
divided into two sub-rounds. The first sub-round is called passive sub round, 
while the second one is called the active sub-round. The first round, p = 1, is 
an exception where it is both active and passive sub-round. 

Let WAGPp be the running instance of the modified WAGP at round p. 
The action taken at time t is based on WAGPp if time t is in the active sub 
round of p. As a result of this action, the sample mean rewards and counters 
are updated. Let Nk^p(t) and Xk,p,t be the number of times arm k is chosen and 
the estimate of the arm k at round p at time t, respectively. At the beginning 
of each round p, the estimates and counters of that round are equal to zero, i.e 
Nk^p {2ThP -I- 1) = 0 and Xk^p, 2 ThP+i = 0- However, due to the two sub-round 
structure, when the active sub round starts, the learner has already observations 
from the passive sub-round. 

In the static global parameter model, we were able to bound the problem 
specific regret with a finite constant number (independent of time horizon T) 
and the parameter-independent regret with a sub-linear in time (T'*' for 7 > 0). 
However,when global parameter is changing, it is not possible to give a sub- 
linear or finite regret bounds. Therefore, we focus on the average regret, which 
is given as 


Reg-'’(r) := i 

\t=l t=l 

Next theorem quantifies the average regret bound with respect to the stability 
and exponent of the drift. 

Theorem 9. Under Assumptions\^and\^ when the half window length of the 
time windowed WAGP is set to Th = r(“'' 2 +o- 5 ), the average regret is 

( -“T-iT-i 

Reg‘^'‘%T) = O ( T( 2 “-r 2 +i) 

Theoremj^shows that the average regret is bounded by a decreasing function 
of the stability of drift and informativeness. This is expected since the greedy 
policy is able to track the changes in the parameter when the drift is slow. 
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Since the informativeness of the arms is directly related to the learning rate 
of the global parameter, the tracking performance of the modified algorithm is 
increasing with the informativeness. 


6 Conclusion 

In this paper we introduce a new sequential decision making model called global 
bandits. Global bandits model problems in which the learner can reduce the 
uncertainty about the expected arm rewards by exploiting the correlation be¬ 
tween the expected rewards of the arms. Under mild assumptions, we design a 
learning algorithm that achieves bounded regret. Within this class of models 
we focus on globally informative multi-armed bandit (GI-MAB) and globally pa¬ 
rameterized multi-armed bandit (GP-MAB), which encompasses the previously 
introduced linearly-parametrized bandits as a special case. For GI-MAB we 
show that regret is bounded if there exists sufficient correlation between the 
expected arm rewards. For GP-MAB, in addition to the bounded regret we also 
show that the worst-case regret is sub-linear in time, that the time order of the 
worst-case regret depends on the informativeness of the arms, and the Bayesian 
risk reduces to the Bayesian risk of linearly-parameterized bandits when the 
arms are fully informative. 


7 Appendices 

7.1 Preliminaries 

In all the proofs given below let ^^{t) := (wi{t),... ,WK{t)) be the vector of 
weights and N{t) := {Ni{t ),..., A^fc(t)) be the vector of counters at time t. We 
have w{t) = N{t)/t. Since N{t) depends on the history, they are both random 
variables that depend on the sequence of obtained rewards. 

7.2 Proof of Lemma 1 

Consider a parameter value d € 0. For any suboptimal arm k € K. — k*(9), we 
have Aifc.(e)(6') - p.k{d) > 6,nini0) > 0. We also know that /Xfc(6»') > p.k>{ 9 ){d') 
for all 9' € 0fe. Hence for any 9' € 0^ at least one of the following must 
hold: (i) p.k{0') > Aifc(^) - d^in{0)/2, (ii) < k‘k*( 9 ){ 0 ) + '5min(6')/2. 

If both of the above does not hold, then we must have p.k{d') < p,k*(e){(^')^ 
which is false. This implies that we either have HkiO) — f^kiO') < dmin(^*)/2 or 
fkk* ( 9 )id) — fkk* ( 9 ){d') > —dmin(^)/2, Or both. Recall that from Assumption I we 
have \9 — 9'\ > \nk{0) — pLk{9')\^/'^'^/. This implies that \9 — 9'\ > eg for all 
9' G 0fc. 
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7.3 Proof of Lemma 2 

Assumption 2 ensures that the reward functions are either monotonically in¬ 
creasing or decreasing. We generate imaginary functions that are = /ife(6*) 

for 0 G 0 and for y, y' G [0,1], 

\'^k\y)-^k\y)\<Di\y-y'r (3) 

We have also > 1 when y > maxg^Q Hkid) and < 0 when y < 

minegeMfe(6')-Then, 


K 

\0* - ^t\ = \ '^Wkit)9k,t - 6**1 

k^l 
K 

= ^Wk{t) 
k^l 
K 

k^l 

K 

- '^'^k{t)Di\Xk,t - (4) 

k^l 

where we need to look at following two cases for the first inequality. The first 
case is Xk,t € where the statement immediately follows. The second case is 
Xk,t ^ yk, where the global parameter estimator 9k,t is either 0 or 1. 

7.4 Proof of Lemma 3 

Note that It G argmax^g^ Therefore, we have 

yiASt)-yk^i0,){9t)>O. ( 5 ) 

Since = yk*{ 9 ,){S*), we have 

y*{9,^) - yi^{9„) = yk»i9,){9*) - yit{9*) 

< yk*(9,){9*) - yu{9^) + yi^{9t) - yk‘(9,){9t) 

= yk*{9,){s*) — yk‘{e„){9t) + yit0t) — 

where the first inequality follows from ([^ and the second inequality follows from 
Assumption 1. 
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7.5 Proof of Lemma 4 


{l^** - 0t\ > a;} C y^Wk{t)Di\Xk,t - > x 

\k=l 

K 

C \J{wk{t)Di\Xk,t- ^kk(.6*)V^ >Wk{t)x} 

fc=i 

= (}Uxk,t-^^k{0.)\>{^r^]. 

k=l ^ ^ ^ 

where the first inequality follows from Lemma and second inequality is due to 
the fact that 

7.6 Proof of Proposition 1 

Let Ui := and Uj := fkj{0*) for arbitrary i,jGX where Ui ^ Uj. Then, 

We have 

< D2 

< D2D'1^\u-u'P^^\ 

Also, since fij{uj) = = Hi{0*) = Ui, we have 

I 0 < \Ui aij I. 

7.7 Proof of Theorem 1 

Observe that if all the arm rewards are estimated accurately within the range 
6/2, then an optimal arm will be selected with probability 1. Let £k be the event 
that \uk~Uk\ ^ <5niin/2- Then, the probability of choosing a suboptimal arm can 
be bounded by the union of events £k, i.e., Pr (/4 ^ k*) < Pr(U^;^5fc), which 
can be further bounded using the union bound, i.e., Pr(/t ^ k*) < J2k=i Pr('^fc)- 
Since the loss due to a suboptimal arm selection is bounded by 1, we have 

T K 

Reg(M, T) < EE Pr 

t=i fc=i 

Let Wk{t) = Nk{t)/t. Using the definition of u^., we have 

T K (( ^ 

Reg(M, T) < EE Pr I I ^ Wj{t)\uk- fk,j{uj)\ 

t=i k=i \ \j=i,j=jtk 

T K { 

<EEPM U {Wk- fk,j{Uj)\>^\[j^\uk-Uk\>^ 

t=l k=l \j=l,j^k ^ 



+ Wk(t)\uk -Uk\> - 
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Since 


I'^/c fki’^j) fk,j{Uj)\ 

< e + D\uj — Ujl'*, 

where inequality follows from the fact that the estimator functions are e-informative 
and Holder continuous, we have, 

Reg(i6,T) < ^ - 0 ■ 

i — 1 k — 1 \k — 1 / 

Let Se = 5/2 — e and 6c = mm{{Se/D)^^ 5/2}. Then, 

T K K 

Reg(M,T) < 

t=i k=i j=i 

^EEE 2exp{-2S'^Nj{t)) 

k^l j^l 
T 

<Y,2K^exv{-25lt/K) < oo, 

t=i 

where the second inequality follows from the Chernoff-Hoeffding bound and the 
third inequality follows from the worst-case selection process: Nkit) = t/K for 
all k G 1C. Therefore, given 6^ = S/2 — e > 0, the WAGP achieves bounded 
regret. 


7.8 Proof of Theorem 2 

Using Lemma the mean-squared error can be bounded as 

' K 


E 


\0* — Ot 


< E 




\k=l 
K 

k=l 


271 


where the inequality follows from the fact that ^k < K X)Li 

ak > 0. Then, 


r K 


9t\^ 

< KDl& 

^uifc(t)E \Xk,t - fkk{0*)\'^'^^\w{t) 

k—1 



< kdIe 

^ nOO 

Y.wlit) PT{\Xk,t-tkk{e.)f'^^ 

_k=i •'^=0 

> x)dx 


22 















where the second inequality follows from the fundamental theorem of expecta¬ 
tion. Then, we can bound inner expectation as 

poo poo ^ 

/ Pr{\Xk,t - > x)dx < / 2exp{-x^Nk{t))dx. 

Jx—0 Jx—0 

= 27ir(7i)iV,-^ni), 

where N^it) is a random variable and r(-) is Gamma function. Then, we have 


E[|0*-dtp] < 2KjDlTiji)E 


K 




.fc=l 




< 2KjDlT{'ri)t-^\ 

where the last inequality follows from the fact that 


for any Nk{t) since Y.k=i ^k{t) = t. 


7.9 Proof of Theorem 3 

By Lemma 1^ and Jensen’s inequality, we have 

-v 1 72 

E[rtie,)]<2D2E\\e,-0t\ 

Also by Lemma and Jensen’s inequality, we have 


E 


\d* — Ot 




K 


'^Wk{t)E \Xk,t - ^kk{6*)\\w{t) 


7i 


U=i 


( 6 ) 


(7) 


where E[-|-] denotes the conditional expectation. Note that Xk^t = J2xeXk t 

and E 3 ,^i,^(e^)[a;] =/ifc(0*). Therefore, we can bound E[|Afc_t —|it>(t)] for 
each k G X using the Chernoff-Hoeffding bound. For each k G X, we have 

pi 


E 


\Xk,t - k'ki0*)\ 


Pr ( \Xk,t - tkk{d*)\ > x\w 


J X—0 

pOO 

< / 2exp{—2x^Nk{t))dx 
Jx=0 


dx 


< 


2Nk{ty 


( 8 ) 


where Nk (t) = twk (t) is a random variable and the first inequality is a result of 
the Chernoff-Hoeffding bound. Combining Q and Q, we get 


E[|0*-0i|]<2JJi(|)^^E 
^ t 2 


K 




U-1 


(9) 


Since Wk{t) < 1 for all k G X, and y2k=i'^kit) = 1 for any possible w{t), we 
have . Then, combining W and M, we have 

E[r*(0*)] < 20^021"^ 

^ t 2 
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7.10 Proof of Theorem 4 


This bound is consequence of Theorem [3| and the inequality given in [Chlebusl 
([200^, i.e., 


E[Reg(0*,r)] < 1 


2Dj^D2f-^K^ 


1 _ 7172 
2 


(1 




7.11 Proof of Theorem 5 


We need to bound the probability of the event that It k*{9^,). Since at time 
t, the arm with the highest Hk{0t) is selected by the greedy policy, 9t should lie 
in 0 \ Qk*{e,) for greedy policy to select a suboptimal arm. Therefore, we can 
write, 


{It ^ k*{9,)} = {9t&e\ &kH9,)} c 
By Lemma 1^ and (10), we have 


( 10 ) 


K 


Pr(/i^r( 07 )<^E 

k^l 

K 

k^l 

K 

<E2E 


E 


I 



\N{t) 


Pr r, 


e„9t 






exp (-2 Nk{t) 


<2Kexp -2 ('M” fo, 


( 11 ) 


where the first inequality follows from a union bound and the second inequality is 
obtained by using the Chernoff-Hoeffding bound. The last inequality is obtained 
by using the worst-case selection processes Nk{t) = t/K. We have Pr(Jt 7 
k*{9^)) < \/t for t > (7i(A*) and Pr(/t ^ k*{9^)) < 1/t^ for t > C' 2 (A*). The 
bound in the first regime is the result of Theorem The bounds in the second 
and third regimes are obtained by summing the probability given in 0 from 
Cl (A*) to T and C 2 (A*) to T, respectively. 


7.12 Proof of Theorem 6 

Let (fl, P) denote probability space, where II is the sample set and P is the a- 
algebra that the probability measure P is defined on. Let uj G It denote a sample 
path. We will prove that there exists event N G P such that P{N) = 0 and 
if w G A'^, then lim^-poo li(w) = k*(9^,). Define the event £t := {It ^ fc*(0*)}. 
We show in the proof of Theorem ^ that Borel-Cantelli 

lemma, we have 
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Pr(£( infinitely often) = Pr(limsupft) = 0. 

t^OO 

Define N := linisupj_^oo £t, where Pr(A^) = 0. We have, 

= liminfS^ 

i—>-oo 

where Pr(A^‘^) = 1 — Pr(A^) = 1, which means that It = k*{9i^) for all but a 
finite number of t. 


7.13 Proof of Theorem 7 

The one step loss due to suboptimal arm selection with global parameter esti¬ 
mate 9t is given in Lemma Recall that we have 

{It^k*{9,)}C{\9,-9t\> 


Let Yg ■= \9^ — 9t\. Then, we have 


T 


Risk(T) < 2 L >2 ^ E 6 '*~/( 6 ») 

Yu{e,) 


T 

Yuie.) 

* 

<1 

A 

1—1 


t=i 


where I(.) is the indicator function which is 1 if the statement is true and zero 
otherwise. The first inequality follows from Lemma The second inequality 
follows from Jensen’s inequality and the fact that !(•) = F(-) for any 7 > 0. We 
now focus on the expectation expression for some arbitrary t. Let f{9) denote 
the density function of the global parameter. 




E, 


■(«.) 




p1 poo 

/ / Pr(yg g I(yg g > A*) > x)dx<19 

nl pOO 

/ f{0*) Pr{Yg^g^>x)dxd9 

J 9^ =0 J x—A* 

P 1 POO 

/ 5(A)/ Pr(yg^ > a;) da; dA, 

«/ A=0 x—A ’ 


where the last equation follows from a change of variables in the integral. Note 
that we have by Theorem 


Pr(Fg^ g^>x)< 2K exp 


2 a;^i 



25 














Then, we have 

Egr^l^g Exr^v > A*) 

<2KB j exp(^-2A^D~'^ dA J exp (^-2y^ D~^ dy 

= 2KB {^2-"-^DiK"-^T 

where the inequality follows from the change of variable y = x — A and the fact 
_2_ ^ — 

that {y + A) f I > since 2/71 > 1. Performing a summation from 1 

to T, we get 

f 1 + 7 l(l + 21 ogT) if 7;^72 = 1 

- I 1 + A (1 + if 7172 < 1 

where A = 2 D 2 r^(^))- 

7.14 Proof of Theorem 8 

Consider a problem instance with two arms with reward functions yi{0) = 0'^ 
and y{0) = 1 — 9'^, where 7 is an odd integer valued number and rewards are 
Bernoulli distributed with Ai_t ~ Ber(/ri(0)) and A 27 ~ Ber(^ 2 ( 6 *))- Then, 
optimality regions are 0i = [2“t,1] and 62 = [0,2“^]. Note that 72 = 1 
and 7 i = 1/7 for this case. Let 9* =2~^ and true value of global parameter 
9^ = 9* + A for some A > 0. Then, optimal arm is arm 1 and one step loss by 
choosing arm 2 is bounded by 

{9* + A)'^ - (1 - [9* + A)'^) = 2(61* + A)T' - 1 

= 2{{9*y + Q {9*y-^A + Ci(A)) - 1 

> 2 ( 6 i*)'^-l + 2 Q^ {9*y-^A 
= 272 ^A. 

1—7 

Therefore, there exists oi = 272 t such that one-step loss by choosing arm 2 
is lower bounded by oiA for 9* > 0. Then, we can lower bound the regret as 

Tt{0* + A) >aiAVy9t <= 9*) 

= ^(Pr(0; - 0* < - A) + Pr(0t -9,< - A)) 

= “^(Pr((dt - 9,y < -Ay + Pr((dt - 9,y < -A'^)) 


> (0*r< A^) 

(12) 

+pr((0tr-(0*r<-A^)), 

(13) 
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where the last inequality follows from the fact that {a — b)"^ < — for 7 >= 1. 

Note that informativeness of both arms are the same and the best estimate can 
be found when we ob serve the rewards from the same arm. Therefo re, the 
best estimator of (12) is best estimator of ( |l3[ ) is 

9t = = (1 - ^2.*)" ■ Then, 

rt(r + A) > “^(Pr(Ai,t - 92 < -A^) + ¥r{X2,t - (1 - 92) > A'^)). (14) 

Define two processes vx = Ber(6>J) ® Ber(0* — A''') and ^2 = Ber(0* + 
A^) ® Ber(0j) where ui ® V 2 denotes the product distribution. Let Prj, denotes 
probability associated with distribution v. Then, (14) is equivalent to 

Reg{9* + A, T) > ^ ^ Pr,®* (J* = 2) + Pr,«* (/* = 1), 


where is the t times product distribution of Using well-known lower 
bounding techniques for the minimax risk of hypothesis testing [Tsybakov and| 


Zaiats (2009), we have 


Reg(r + A,r) > ^^exp(-KL«,i/f)), 


(15) 


where 


KL«, O = t(KL(Ber(0:), Ber(0: + A^)) + KL(Ber(0: - A'^), Ber(0:)). 


By using the fact KL(p, q) < 
bound (15) by 

T 

Reg(r+A,T)>^^exp(- 


Rigollet and Zeevi 


( 2010 ), we can further 


2tA2')' 


-h At')(1 -92- At) 


> 


A27-1’ 


where = 077(1 — exp(—16)) for T > 8. Then, by setting A = T , we have 


Reg(r+A,r)> ^(l-exp(-16))ri-JT. 
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7.15 Proof of Theorem 9 

By Lemma 1^ and Jensen’s inequality, we have 

^[rt{9l)] <2D2E\\9l-9t\ 


(16) 
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where 


EtiNkAt)f^k'iXk,p,t) 

Tp{t) 


e, = 


where ^k,p{t) = Tp{t). Then, by using Lemmaj^ we have 


Nk,pm Xk,p,t-MK) \NkAt) 


E 


-9t 


< 


Tp{t) 


Let ( be the set of times times that arm k is chosen in round p before time 
t, i.e., 

^k'o t= {t' 2(p - l)Th <t'< 2pTh). 


Clearly, J = Nk,p{t). We have. 


Xk,p± — 




Xk,p{t) 


where AXk,t'] = Pk{9l) for all t' G Define a random variable Xk^t' = 

Xk.t' — Pk{9l ) for all t' G k & X. and p. Observe that {Xk^At'^sl*^ 

a random sequence with AXk,t'] = 0 and Xk^v G almost surely for all 

k G X and p. Then, 


E 


Xk,p,t — k'k(dl) \Nk,p{t) 


< E 


< E 


Et'esl’' iXk,t' Pk{9\)) 

K,p,t 


Xk,p(t) 




+ 


Et'^s:^ \M0l) -Mol)\ 

K,p,t 

Xk,pit) 


Xk,pit) 

where for any t' G Sk,p,t, k G X and p, 


Nk,p{t) 


E 






poo 

1x^0 

pOO 


Pr 


, Xk,t' 


XkA^) 


> cc dx 


pOO 

< / 2exp(—x^A^fc,p(t)) dx 
Jx=0 


NkAty 


where the inequality follows from the Chernoff-Hoeffding bound and 

\ei-oi\<L{2TATr, 


(17) 


(18) 


28 










































since for all t,t' G we have — < 2rft. Then, using (171 and (18 1 , the 

expected gap between 9\ and Ot can be bounded as 


E 


\0l - Ot 


Ef=i^iE 


< 


< 




Tp{t) 

Tp{t) 


< Di{{'KK)^Tp{t)-^-ir +2i:)^iL'>'i'^2(2r^/T)“T'iT'=) 


,21 /,x_2i 

^p( 


where the second inequality follows from the fact that (a + h)'^ < + hH 

for a,b > 0 and 0 < 7 < 1 , the third inequality is due to the worst case 
selection process, i.e., Nk,p{t) = Tp{t)/K for all k G 1C, and the fourth inequality 
follows from the fact that Tp{t) > Th- By choosing Th = A, we get the optimal 
h = Q cumulative regret at time T can be bounded as 


1 2 
XlE[ct(6»‘)] + (2D2DI^[{tiK)A 


which concludes the proof. 
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